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Abstract 

Klaassen in m proposed a method for the detection of data manipulation given the means and standard 
deviations for the cells of a oneway ANOVA design. This comment critically reviews this method. In 
addition, inspired by this analysis, an alternative approach to test sample correlations over several 
experiments is derived. The results are in close agreement with the initial analysis reported by an 
anonymous whistleblower [T. Importantly, the statistic requires several similar experiments; a test for 
correlations between 3 sample means based on a single experiment must be considered as unreliable. 


1 Introduction 

An analysis of means and standard deviations mi- culled from a series of scientific publications, 
led to a request for retraction of a subset of the papers m- The analysis was based on a method 
reported in Klaassen m aimed at detecting a type of data manipulation that causes correlations 
between condition means of samples that are assumed to be independent. Specifically, given a 
one-way balanced ANOVA design with 3 conditions, Xj, i = 1,..., 3, the means obtained by 
averaging over the scores of n different subjects in each condition, are samples of a 3-dimensional 
normal distribution 

( Xi \ / / Pi \ / o\ o\(J2p\ cricr 3 p2 \ \ 

X‘z P '2 ,n ~ 1 & 1 & 2 P 1 cr 2 cr 3 p 3 1 , (1) 

X 3 J \ \ p3 / \ 0-i<7 3 p 2 cr 2 CT 3 p 3 cr| / / 

where pi are the unknown true expected values and eq the unknown sample standard deviations 
of the scores under the respective conditions and pi their correlations. The ANOVA assumes 
independence between the samples of the conditions, such that pi = 0. Indeed, given only 
samples of Xi and estimates of cq, the sample correlations pt are not directly accessible. 

An anonymous whistleblower pointed out [I], that the results in the studies under suspicion 
(i.e [B], compare Figure [I]), show a super linear pattern which appears too good, to be true. 
Importantly, the authors of the original publications did not necessarily expect such patterns of 
equidistant means; they expected an ordinal, not a linear relation between the three condition 
means. Nevertheless, the reanalyses were carried out under the assumption of an expected strict 
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Fig. 1 : Condition means {x\, X2 and 13) and standard deviations for the 12 experiments reported 
in [6]. The condition means X\ and X3 have been connected by a line to visualize the 
deviance from a perfect linear behavior of the condition means. 


linear relation between means. The reason was that this strict assumption is conservative with 
respect to an inference of data manipulatioif] 

Under the assumption of a strictly linear relationship between the group means, pi = a + fi-i, 
the scores can be described as X t = a+f 3 -i+e,; which implies that 0 = E[Z] = E[X\ — 2 X 2 +X 3 ] = 
pi — 2/i2 + /r 3 . This linear-combination of sample means Xj yields a new random variable Z with 
the (univariate) normal distribution Z ~ 7 V( 0 , n _1 crf ((?, p)). Where o\(d, p) = + 4 erf + erf — 

4<7icr 2 pi — 4:a2<J3P3 + 2 (j 1 (J3P2 • Note that the random variable Z can be seen as the deviance 
from the strictly linear behavior a + fli. 

Introducing correlations between the samples increases or decreases the variance of Z. Klaassen 
m assumes that a plausible data manipulation (e.g., adjusting the mean of the middle sample 
towards the mean of the means of the lower and upper samples to achieve significant differences 
between the groups) leads to a decrease of the variance of Z, cr|(-, •). Such a variance reduction 
may have gone unnoticed as humans tend to underestimate variance in data. As mentioned 
above, the results under suspicion show a super linear behavior and hence a small variance in Z 
which may not be expected given the group variances erf under the assumption of independence. 

Consequently, Klaassen m used a simple likelihood-ratio test to decide whether there is 
evidence for data manipulation in terms of a evidential value as 

max f(z\a z (a,p)) 

V= — -,-, 

f(z\a z (a,0)) 

comparing the maximum likelihood of all feasible vectors of correlations p with the likelihood of 
z under the assumption of p = 0, where 

T = {p : pi G (-1,1), p\ + pl + pi - 2 pip 2 p3 < 1, cr z (cr,p) < a z (a, 0 )} , 

is the set of feasible correlation vectors, maintaining that the covariance matrix (in eq. |T]) 
remains positive definite and ensures that cr^(a, p) < 0 )Vp G T. As the true sample 

1 It is also not clear how a suitable test could be constructed for the assumption that the means are expected 
only in a monotonic, not necessarily equidistant order. 
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standard deviations a are unknown, they might be replaced by the reported ones s, since s 
as n —> oo. 


2 An asymptotic test statistic 


Without any knowledge of the test statistics, i.e. the distribution of V under the null hypothesis 
Hq (independent group means), it is not possible to interpret the value V and hence to decide 
whether a certain value of V does provide evidence for the presence of sample correlations. 
The estimates of the sample variances (s 2 ) are themselves random variables with some unknown 
distribution. It is therefore rather unlikely to obtain a closed form expression for the test statistic, 
even under restrictive assumptions about the distribution of s. 

Nevertheless, as proposed by Klaassen mi. one may assume that asymptotically s a as 
n —> oo. Then one can assume that the sample variances a are fixed and known, allowing for the 
construction of an upper-bound asymptotic test statistic. 

The likelihood to obtain a specific value z, given the sample variances a 2 and correlations p 
is 

f{Z = z\a z (a, p)) = _ exp • 


V27 ra z (a,p) 


and therefore 


V = max 


cr z (<X,0) 


PGJ 7 a z (a,p) 


exp < - 


n z 


2cr I(W/3) 


n z 


2^1 (a,p) 2cr|(<T,0) 


( 2 ) 


Now, let a = 


&z(<5,p) 


<t z (ct,0) 


be the relative standard deviation and ctq = a z (a,d ) then 


V = max a 
a£A 


exp 


2 a 2 (jg 


2cr o 


The feasible set of all a values A is implicitly defined by the feasible set of correlations as 


A = 


*(oyO) 


: p e T 


From this it follows immediately that A C (0, 1] as <r z (a , p) < a z (a,6 ) Vp S T. 

Under a worst-case scenario, one may assume A = (0,1]. This implies that for every a £ (0,1] 
it is possible to find a feasible correlation vector p £ T such that a z (a,p) = a ao. Please note 
that this is not ensured in general. The worst-case assumption, however, allows one to obtain 
upper-bounds for the distribution of V under H 0 analytically by relaxing the constraints on a 
implied by the feasibility constraints on p. 

Within this setting one gets 


V < V = max a 1 exp — 2 + 


ae( 0 ,l] 


n z 


n z 


2 °l 


With z = yyyb the normalized z with respect to the expected standard deviation under Hq 


V = max a 1 exp — —- + — 


a£( 0 ,l] 

Straightforward computation reveals 
0 = d a ( log 


2 a 2 


z 

2 


a 1 exp 


z 

2a? 


z 
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2 ~2 
a = z . 
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and therefore 



Under the worst-case scenario, an upper-bound evidential value V > V can be computed 
directly without maximizing the likelihood-ratio numerically. This result was also found by 
Klaassen (compare eq. 18 in [TT1). 

Knowing that the maximum V is achieved at z 2 = a 2 and therefore 7J ^r = az[ J T 2 ' p> , one 

may conclude that the likelihood-ratio test compares the expected variance <Tq under Hq with 
a variance estimated from a single sample. Such a variance estimate is known to be unreliable 
and therefore the evidential value for a single experiment must be unreliable, too. This issue is 
discussed in detail in the next section. 


3 Testing multiple experiments 

Klaassen m, see also m suggested to obtain the evidential value V for an article consisting of 
more than one experiment as the product of the evidential values V 7 of the single experiments 
in the article. The evidential value V of a publication given N experiments is then 


N N 


j =i i=i 


f(zj\a z {vj,p)) ' 


( 4 ) 


Given that Vj > 1, this immediately implies that the product grows exponentially with the 
number of experiments even if Hq is true. Instead of obtaining the evidential value for every 
single experiment in an article, which (in a worst-case scenario) is based on a variance estimator 
from a single sample (cr 2 Z j = rijZ 2 ), one may try to base that variance estimation on N samples 
provided by the N experiments in an article. I.e. 


N 


V 


= max 


n 


f{ z iWz{Vj,p)) 


( 5 ) 


where the feasible set T = f) jLi , is just the intersect of all feasible sets Tj of every experiment. 

The idea of this alternative approach is simple: We cannot make a reliable statement about 
the probability of observing a single suspiciously small Zj, particularly as 0 = E[Z] under Hq. 
However, observing a suspiciously small z repeatedly is unlikely and may indicate sample corre¬ 
lations between groups. 

Following the worst-case scenario above, the joint evidential value for N experiments is asymp¬ 
totically 


V = 


-N 

max a exp 
oe(o,i] 


rij z 2 rij z 2 

j =0 2 a a 0j j=Q Za 0,3 


N 


-N 

max a exp 

aG(0,l] 


Z? 
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2a 2 ^ 


N ~2 
Z j 


3=0 


3=0 
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where again Zj = and <ro,j 

prisingly familiar result 


az(crj,0). 


A straightforward computation reveals the sur- 


N 




3=1 


This implies that, in a worst-case scenario, the joint likelihood-ratio compares a variance 
estimate based on N samples with the expected one. And finally 


V = 


exp < — A + 


e; 


Zj j 


1 < A 

1 — JV Zj ) = 


3=1 ' 


else. 


Note that the joint evidential value for N experiments relies on the fact that Zj ~ 7V(0,1) 
i.i.d. under Hq and therefore ~ Xn- Hence the test statistics for sample correlations 

between groups can be expressed as a simple chi-squared statistic and one does not need to make 
the detour of obtaining an approximate distribution of V under Hq. 


4 Relation to the A F test 

The y 2 -test derived in the last section is closely related to the AF-test suggested by the whistle¬ 
blower pp. This test was also included in the report for the University of Amsterdam El- 

Under Hq and the assumption of a linear trend, the p-values of the AF-test for a single 
experiment within an article are distributed uniformly in [0,1]. Using Fisher’s method, it is then 
possible to obtain a p-value for an article comprising several experiments. The major difference 
between these two methods is that the AF-test first determines a p-value for every study and tests 
whether the resulting p-values pj are to good to be true while the chi-square test introduced here 
assesses this value directly by inspecting whether the relative deviations form perfect linearity 
Zj are to good to be true. Therefore, unsurprisingly, the two methods yield very similar results 
(see Table [I]). 


Article 

y 2 -test 

AF-tests 

Classification 

JF09.JEPG [3] 

8.06e-07 

2.30e-07 

strong 

JF11.JEPG H3 

8.73e-07 

3.53e-07 

strong 

JF.D12.SPPS [Ej 

7.14e-09 

1.82e-08 

strong 

L.JF09.JPSP HU 

6.44e-4 

8.46e-5 

strong 

L.JF09.JPSP* 

0.03 

0.02 

- 

JF.LS09.JEPG @] 

0.25 

0.11 

strong 

JF.LK08.JPSP 0 

0.81 

0.66 

inconclusive 

D.JF.L09.JESP 0 

0.93 

0.52 

inconclusive 

Reference [8, 9, 10, 12, 15, 18, 20, 22, 21j 

0.11 

0.14 



Tab. 1: Comparison of p-values obtained with the direct y 2 and AF tests for studies classified 
as providing strong or inconclusive statistical evidence for low veracity by Peeters et al. 
[ 17] , The first three studies listed in the table were reported by the whistleblower [GQ. 
Note the divergence for JF.LS09.JEPG between the present analysis and E3- Only those 
studies from nu were considered here which provide at least 8 experiments. 
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JF.LK08.JPSP: p=0.81094855 


D.JF.L09.JESP: p=0.89804776 


Reference: p=0.11149939 



Fig. 2: The distribution of Zj (short dashes at the bottom of each panel) for each experiment from 
the articles listed in Table |TJ The solid line shows the expected distribution of Zj under 
H 0 while the dashed line shows the normal distribution with 0-mean and the variance 
estimated from the samples Zj. 


Both methods, the x 2 and AF tests, are conservative compared to the V-value approach 
by Klaassen HD. For example, the article JF.LS09.JEPG in Table j 1 1 was classified with stvoTic/ 
statistical evidence for low veracity |T7] (compare also Figure [ 3 ]). In contrasts, the \ 2 an d AF 
methods, yield p-values of « 0.25 and « 0.11, respectively, suggesting that there is no evidence 
of sample correlations between groups. The three methods agree for the studies JF.LK08.JPSP 
and D.JF.L09.JESP which were classified with inconclusive statistical evidence for low veracity. 
The three methods also agree on classifying the three articles reported by the whistleblower pQ 
with strong statistical evidence for low veracity. 

Depending on the chosen level of significance, the article L.JF09.JPSP could be classified as 
strong or inconclusive. This article contains conditions for which the authors did not expected a 
specific rank ordering of the condition means. Peeters et al. m included these control conditions 
but reordered them according to increasing group means, yielding a p-value for the y 2 -test of 
about 0.0006 (L.JS09.JPSP in Table [l]). Although the assumption of equidistant group means, 
i.e. 0 = fii — 2yi 2 + p 3 , contains the assumption of equal group-means, i.e. pi = P 2 = /13 as a 
special case, the actual test-result depends on the ordering of the conditions. Keeping the order 
of conditions as reported in m yields a p-value of about 0.015 and excluding them results in a 
p-value of about 0.03, shown as L.JF09.JPSP* in Table [l] 

The discrepancy between the \ 2 or AF methods and the V-value method for the JF.LS09. JEPG 
article [Tj is due to the tendency of the V-value method to indicate strong evidence if a single 
experiment out of a series of experiments has a very small 5-value. In contrast to the V-value 
method, the % 2 and the augmented V-method (see Section |3j take all experiments of an article 
into account by assuming the same correlation structure for all experiments. 

For the particular article [1], the V-value approach reported strong evidence for low veracity 
because the last two experiments (compare Figure [3]) exhibit the super linear pattern associated 
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Fig. 3: Condition means and stdandard deviations for 9 experiments from |¥j. 


with sample correlations. The % 2 and A F method, however, do not indicate significant sample 
correlations as the deviance of remaining experiments fit well into the expected distribution under 
H 0 , especially the results in panels 5 & 6 in Figure [3j 

Klaassen 033 intended the V-value to be sensitive for single experiments. The argument is 
that bad science cannot be compensated by very good science 3H3- Finding a small value for Zj in 
a series of experiments, however, is quiet probabl^] even under Hq. Hence one could argue that 
a single suspiciously small Zj can not be interpreted as strong evidence for sample correlations. 


5 Discussion 

There is no doubt that, in principle, statistics can be used to detect sample correlations that are 
due to data manipulation. The approach proposed in M. however, is not without problems. 

A first problem is the missing test statistics for the evidential value V. Although an upper- 
bound asymptotic test statistics for the V-value of a single experiment can be obtained (see 
Section [2] above and dll), the reliability of the V value for a small n remains unknown (as well 
as how large a large n must be to be considered large). 

A second problem is the critical value of V* = 6 chosen by the authors, which implies 
(asymptotically) p « 0.08. Arguably, this is a rather high probability of falsely accusing a 
colleague of data manipulation. 

A third problem is the assumption that the product of the evidence provided by every single 
experiment in an article can serve as a metric of evidence for data manipulation in this article. As 
mentioned above as well as in the comments to the article at pubpeer.com m and in a response 
by Denzler and Liberman [l4j . this assumption implies that the evidence for data manipulation 
grows exponentially with the number of experiments even under Ho. The probability of V > 2 
for a single experiment is about p ~ 0.25. Thus, about every 4th good experiment will double 
the evidence for data manipulation. 

The fourth problem, finally, is a general concern. The analysis assumes a specific type of 
data manipulation. If this is true, the manipulation will induce correlations between condition 
means. Moreover, under the second assumption that 0 = X\ — 2Xi + X 3 this correlation can 

2 I.e. for 10 experiments [N = 10) p ~ 0.4 for a = 0.05 and p Ri 0.1 for a = 0.01 



































5 Discussion 


be detected. Importantly, however, the reverse is not true: The detection of such correlations 
in the data does not necessarily imply that data were manipulated. For that reason, Peeters 
et al. carefully avoided in HZ] to claim that their findings prove that data were manipulated. 
Instead the results are interpreted as evidence for low data veracity , which is justified. In BH. 
however, Klaassen claims that its method provides evidence for manipulation. Although the 
origin of sample correlations cannot be determined with statistics, their presence certainly vio¬ 
lates an ANOVA assumption. This may result in an increased type-I error rate. Therefore, the 
effects reported in the articles providing strong or possibly even inconclusive evidence for sample 
correlations (e.g [HI 151151H3]) may be less significant than suggested by their ANOVAs. 

In this comment, specifically in Section [3] the concept of the single-experiment evidential 
value was extended to multiple experiments. Moreover, a much simpler chi-squared test was 
provided to test the presence of correlations in the data that is similar to the test proposed in p" 
and yielded very similar probabilities for the presence of sample correlations. Thus, the V-value 
approach can serve as a test for sample correlations, if it is applied across several identical or at 
least similar experiments. In this case one is also able to decide whether the variability in the 
results is suspiciously small or not. However, estimating az on the basis of a single experiment 
will certainly not reveal a reliable result. 
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